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SUMMARY 


The  use  of  the  finite  element  and  finite  difference  methods  often  leads  to 
the  problem  of  solving  large,  sparse,  positive  definite  systems  of  linear  equa- 
tions. Recently  the  one-way  dissection  and  nested  dissection  algorithms  have 
been  developed  for  solving  such  systems.  Concurrently,  vector  computers  (com- 
puters with  hardware  instructions  that  accept  vectors  as  operands)  have  been 
developed  for  large  scientific  applications.  In  reference  1,  George,  Poole  and 
Voigt  analyzed  the  use  of  dissection  algorithms  on  vector  computers.  In  that 
paper,  MACSYMA  played  a major  role  in  the  generation  of  formulas  representing 
the  time  required  for  execution  of  the  dissection  algorithms.  In  the  present 
paper  the  author  describes  the  use  of  MACSYMA  in  the  generation  of  those 
formulas. 


DISSECTION  ALGORITHMS 


When  finite  difference  or  finite  element  methods  are  used  for  approxi- 
mating solutions  of  partial  differential  equations,  it  is  often  the  case  that  a 
large,  sparse,  positive  definite  system  of  linear  equations. 
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must  be  solved.  We  shall  assume  that  the  domain  over  which  the  differential 
equation  is  defined  is  a square  region  covered  by  an  n by  n grid  consisting 


of 

,2 


(n-1) ; 


small  squares  called  elements.  It  follows  that  A is  an 
n*  by  nz  matrix.  The  ordering  of  the  unknowns  at  the  grid  points  determines 
the  location  of  the  nonzero  components  of  A and,  consequently,  the  storage 
and  time  required  to  solve  the  linear  system  by  Gauss  elimination. 


An  ordering  of  the  unknowns  called  one-way  dissection  is  due  to  George 
(see  ref.  2).  Referring  to  figure  1,  the  idea  of  one-way  dissection  is  first 
to  divide  the  grid  with  m horizontal  separators.  The  unknowns  in  the  nrl-1 
remaining  rectangles  are  numbered  vertically  toward  a separator  and  then  the 
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separator  nodes  are  numbered.  The  problem  Is  to  derive  formulas  for  storage 
and  timing  requirements  and  to  minimize  those  formulas  with  respect  to  m (see 
ref.  2). 

The  second  dissection  scheme  is  called  nested  dissection  (again,  see  ref. 
2)  and  has  been  shown  to  be  asymptotically  optimal  (see  ref.  3).  The  idea  here 
Is  to  divide  the  grid  with  both  horizontal  and  vertical  separators  as  shown  in 
figure  2.  Unknowns  in  regions  1-4  are  numbered  before  those  on  separators 
3-7.  Each  of  the  regions  1 - 4 is  a square  and  may  itself  be  dissected  using 
horizontal  and  vertical  separators.  Thus  the  idea  may  be  applied  recursively 
end,  in  the  case  n ■ 2^-1,  nested  dissection  will  terminate  after  k-1  steps. 

Although  both  dissection  orderings  were  analyzed  in  reference  1,  only 
nested  dissection  will  be  discussed  further  here  because  it  is  a more  important 
algorithm  and  the  generation  of  its  timing  formula  was  a much  more  formidable 

task. 

The  nested  dissection  algorithm  is  nontrivial  to  describe  in  detail.  It 
was  first  developed  and  analyzed  with  scalar  computers  in  mind  by  A.  George  in 
the  early  1970's.  The  first  attempts  at  obtaining  a timing  formula  were  done 
by  hand  and  only  gave  a description  of  the  asymptotic  behavior,  O(n^).  Later, 
the  first  few  terms  were  generated  by  hand.  Then  in  reference  3,  A.  George 
obtained  the  entire  formula  with  the  aid  of  ALTRAN. 


VECTOR  COMPUTERS 


The  existence  of  vector  computers,  i.e.,  computers  with  hardware  instruc- 
tions that  operate  on  vectors  rather  than  scalars,  raises  the  question  of  how 
effective  the  dissection  techniques  are  on  this  rather  new  class  of  computers. 
It  is  assumed  that  these  computers  have  basic  vector  instruction  execution 
times  which  are  of  the  form 

T*(J>  “ S*  + JP*  , (2) 

where  T*(j)  is  the  total  time  for  the  vector  instruction  *;  S*  is  an  over- 
head time,  called  "start-up"  time;  P*  is  the  "per-result"  time  of  that 
Instruction;  and  j is  the  length  of  the  vector. 

The  large  value  of  S*/P*  on  currently  available  vector  computers  implies 
that  one  pays  a significant  penalty  for  operation  on  short  vectors;  consequent- 
ly, one  would  prefer  algorithms  which  permit  the  longest  possible  vectors  (see 
ref.  4).  However,  both  of  the  dissection  algorithms  work  by  repeated  subdivi- 
sion of  the  grid  until  a minimum  operation  count  is  obtained.  It  is  this 
apparent  conflict  between  the  cost  of  using  shorter  vectors  and  the  correspond- 
ing lower  operation  counts  that  was  studied  in  reference  1. 
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GENERATION  OF  FORMULAS 


In  reference  1,  George,  Poole  and  Voigt  were  interested  in  obtaining 
parameterized  versions  of  the  timing  formulas  for  the  dissection  algorithms  on 
vector  computers.  Such  formulas  were  needed  in  order  to  study  the  effects  of 
varying  several  parameters.  They  identified  nine  parameters  characterizing  the 
vector  computers:  3 start-up  times  for  vector  addition,  multiplication,  and 

inner  product;  3 per-result  times  for  the  same  instructions;  and  3 scalar 
operations.  Furthermore,  there  was  a parameter,  n,  related  to  the  problem 
size  and  another,  £,  related  to  the  algorithm  which  the  user  could  vary  at 
liberty.  The  goal  was  to  choose  £ so  as  to  minimize  the  timing  formula  for 
a given  set  of  computer  parameters  and  a given  problem  size.  Obtaining  the 
timing  formulas  was  useful  in  several  ways: 

(1)  With  the  formulas  in  hand,  one  could  study  the  effects  of  chang- 
ing values  for  the  parameters.  In  a hypothetical  sense  one 
could  try  to  optimize  subject  to  certain  side  constraints.  In  a 
very  practical  sense,  manufacturers  announced  changes  in  the 
parameter  values  several  times; 

(2)  There  are  several  options  in  the  implementation  of  the  dissec- 
tion algorithms.  For  example,  one  can  use  a vector  inner 
product  or  a vector  "outer  product"  version  (see  ref.  1).  The 
choice  reduces  to  comparing  the  time  required  for  a vector  inner 
product  versus  a vector  addition  plus  a vector  maltiplication. 
Timing  formulas  permitted  analysis  of  such  options; 

(3)  Considerable  insight  into  the  vectorizatlon  of  algorithms  was 
gained.  For  example,  average  vector  lengths  could  be  studied; 

(4)  Without  the  formula,  a table  of  timing  values  for  particular 
choices  of  the  parameters  could  be  generated  by  executing  a 
model  of  the  algorithm.  However,  the  coefficients  in  the  formu- 
las could  not  be  generated. 


The  nested  dissection  timing  formula  was  generated  in  the  following  manner. 
The  execution  of  the  nested  dissection  algorithm  was  simulated  in  a top-down 
fashion.  The  top  level,  level  1,  involved  several  summations  of  which 


Vtt1  - 2)29(°-  , A<n  + ^ , 4) 

i-1  21  21 


(3) 


is  typical,  where  6 is  a procedure  at  the  second  level.  Each  of  the  second 
level  procedures  called  several  third  level  procedures,  e.g., 

THETA(Q,P,K)  CHLSKY(Q)  + P LOWSOL(Q)  + MODNES(Q,P,K)  . (4) 

CHLSKY,  LOWSOL  and  MODNES  are  three  of  the  third  level  procedures  defined  to  be 
the  timing  formulas  for  simple  numerical  computations,  e.g.. 
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CHLSKY(Q)  :> 


(PA  + PM)  Q3  (SA  + SM  + PM)  Q2 


SM  SA  2 PM  PA 

+ (DSR  + ) Q - SM 

2 2 3 6 

Is  the  timing  formula  for  the  factorization  of  a dense  linear  system.  These 
third  level  procedures  were  formulas  for  factorization,  lower  solve  and  upper 
solve  of  dense  systems  and  banded  systems  and  matrix  modifications  of  the  form 

A A - UVWT  . (6) 

Finally,  the  bottom  level  consisted  of  the  parameters  which  characterize  the 
vector  computer.  E.g., 


SA  + Q PA  (7) 

is  the  time  for  a vector  add  of  length  Q. 

The  second  and  third  levels  each  consisted  of  10  to  15  modules  and  level 
4 consisted  of  9 instruction  parameters,  1 parameter  related  to  the  algo- 
rithm and  1 related  to  the  grid  size  for  the  problem.  The  top  level  module 
contained  several  MACSYMA  sums  of  the  form 

SUM( ' ' (EV( ( <2I-2) 2) * (THETA( (N-2I+1) / (21) , 4* (N+l) / (21) , 4) ) , 

(8) 

EXPAND)), I, 1,J-1)  . 

This  is  the  MACSYMA  form  of  the  sum  in  eq,  (3).  The  entire  generated  formula 
consists  of  over  200  terms  and  can  be  found  in  Appendix  B of  reference  1.  The 
formula  was  checked  by  evaluating  it  for  several  sets  of  parameter  values  and 
comparing  the  results  to  execution  times  of  a FORTRAN  simulation  of  the  algo- 
rithm. The  one-way  dissection  formula  was  generated  in  a similar,  but  much 
■ore  forward,  manner. 


CONCLUDING  REMARKS 


MACSYMA  has  been  shown  to  be  of  considerable  value  in  the  study  of  the 
performance  of  the  nested  dissection  algorithm  when  used  on  hypothetical  vector 
computers.  The  derived  timing  formulas  lead  to  an  understanding  of  the  effects 
of  varying  the  parameters  which  characterize  the  computers.  Options  in  the 
algorithm's  implementation  can  be  studied  as  well  as  the  extent  to  which  the 
algorithm  vectorizes. 
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FIGURE  1.  - ONE-WAY  DISSECTION  WITH  ORDERING  OF 
UNKNOWNS  INDICATED  BY  NUMBERS  (m  « 3). 


FIGURE  2.  - ONE  STEP  OF  NESTED  DISSECTION  WITH 
ORDERING  OF  UNKNOWNS  INDICATED  BY  NUMBERS. 
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